30.12.2020 YaCy 'koopa_stembod_online': Solr Schema Editor 


Index Administration . n 
(api/schema.xml?core=collection1) 
URL Database Administration (IndexControlURLs_p.html) Index Deletion (ndexDeletion_p.html) 


Index Sources & Targets (IndexFederated_p.html) Solr Schema Editor (IndexSchema_p.html) 
Field Re-Indexing (IndexRelndexMonitor_p.html) Reverse Word Index (IndexControlRWIs_p-.html) 
Content Analysis (ContentAnalysis_p.html) 


Solr Schema Editor 


If you use a custom Solr schema you may enter a different field name in the column ‘Custom Solr Field Name’ of the YaCy default 
attribute name 


Select a core: | collection? w| ... the core can be searched at /solr/select?core=collection1&q=*:*&start=0&rows=3 (solr/select? 
core=collection1&q=*:*&start=0&rows=3) 


Active Attribute Custom Solr Field Name COMMent | show active | show all available | show disabled 
id [ e] primary key of document, the URL hash **mandatory field** 
sku [ E] url of document 
last_modified [ E] last-modified from http header 
load_date_dt [ f] time when resource was loaded 
content_type [ O] mime-type of document 
title [ E] content of title tag 
host_id_s [ E] id of the host, a 6-byte hash that is part of the document id 
host_s [ E] host of the url 
size_i [ E] the size of the raw source 
failreason_s ........_ ] fail reason if a page was not loaded. if the page was loaded 
then this field is empty 

failtype_s | fail type if a page was not loaded. This field is either empty, 
‘excl’ or ‘fail’ 

httpstatus_i [| himistatus retum code (i.e. "200" for ok), -1 if not loaded 

url_file_ext_s [ e] the file name extension 

host_organization_s .........] either the second level domain or, if a ccSLD is used, the third 
level domain 


inboundlinks_urlstub_sxt [ E] internal links, the url only without the protocol 
inboundlinks_protocol_sxt [ E] internal links, only the protocol 
outboundlinks_protocol_sxt [ f] external links, only the protocol 
outboundlinks_urlstub_sxt [ E] external links, the url only without the protocol 
images_urlstub_sxt [ E] all image links without the protocol and ':// 
images_protocol_sxt [ E] all image link protocols 

fresh_date_dt [ E] date until resource shall be considered as fresh 
referrer_id_s [ E] id of the referrer to this document, discovered during crawling 
publisher_t [ E] the name of the publisher of the document 
language_s [ the language used in the document 
audiolinkscount_i [ e] number of links to audio resources 
videolinkscount_i [ E] number of links to video resources 
applinkscount_i [ E] number of links to application resources 
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title_exact_signature_| 


title_unique_b 


exact_signature_copycount_i 


fuzzy_signature_text_t 


fuzzy_signature_copycount_i 
process_sxt 
dates_in_content_dts 


dates_in_content_count_i 
startDates_dts 


endDates_dts 


references_i 


references_internal_i 


references_external_i 


references_exthosts_i 


crawldepth_i 


harvestkey_s 


http_unique_b 


www_unique_b 


exact_signature_| 


exact_signature_unique_b 


fuzzy_signature_| 


fuzzy_signature_unique_b 
coordinate_p 
coordinate_p_0O_coordinate 
coordinate_p_1_ coordinate 
ips 

author 


author_sxt 
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the 64 bit hash of the 
org.apache.solr.update.processor.Lookup3Signature of title, 
used to compute title_unique_b 


flag shows if title is unique within all indexable documents of 

[sid the same host with status code 200; if yes and another 
document appears with same title, the unique-flag is set to 
false 

C i] counter for the number of documents which are not unique (== 
count of not-unique-flagged documents + 1) 

[| intermediate data produced in EnhancedTextProfileSignature: 
a list of word frequencies 

........ ] counter for the number of documents which are not unique (== 
count of not-unique-flagged documents + 1) 

[ E] needed (post-)processing steps on this metadata set 

[siz if date expressions can be found in the content, these dates 
are listed here as date objects in order of the appearances 

a the number of entries in dates_in_content_sxt 

[ E] content of itemprop attributes with content='startDate' 

[ E] content of itemprop attributes with content='endDate' 

O] number of unique http references, should be equal to 
references_internal_i + references_external_i 

CO] number of unique http references from same host to 
referenced url 


[ O] number of unique http references from external hosts 
[ E] number of external hosts which provide http references 


crawl depth of web page according to the number of steps that 


[ e] the crawler did to get to this document; if the crawl was started 


at a root document, then this is equal to the clickdepth 
key from a harvest process (i.e. the crawl profile hash key) 


[ E] which is needed for near-realtime postprocessing. This shall 


be deleted as soon as postprocessing has been terminated. 
unique-field which is true when an url appears the first time. If 


[ f] the same url which was http then appears as https (or vice 


versa) then the field is false 
unique-field which is true when an url appears the first time. If 


[ e] the same url within the subdomain www then appears without 


that subdomain (or vice versa) then the field is false 


C i] the 64 bit hash of the 
org.apache.solr.update.processor.Lookup3Signature of text_t 
C] flag shows if exact_signature_l is unique at the time of 
document creation, used for double-check during search 
C] 64 bit of the Lookup3Signature from 
EnhancedTextProfileSignature of text_t 
O] flag shows if fuzzy_signature_l is unique at the time of 
document creation, used for double-check during search 
[ E] point in degrees of latitude,longitude as declared in WSG84 
[ E] automatically created subfield, (latitude) 
[ E] automatically created subfield, (longitude) 


[ E] ip of host of url (after DNS lookup) 
[ E] content of author-tag 


C] content of author-tag as copy-field from author. This is used for 
facet generation 
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description_txt 


description_exact_signature_| 


description_unique_b 


keywords 
charset_s 
wordcount_i 
linkscount_i 


linksnofollowcount_i 


inboundlinkscount_i 


inboundlinksnofollowcount_i 


outboundlinkscount_i 


outboundlinksnofollowcount_i 


imagescount_i 
responsetime_i 
text_t 
synonyms_sxt 
h1_txt 

h2_txt 

h3_txt 

h4_txt 

h5_ txt 

h6_txt 

md5_s 


httpstatus_redirect_s 
collection_sxt 


csscount_i 
css_tag_sxt 
css_url_sxt 
scripts_sxt 


scriptscount_i 
robots_i 


metagenerator_t 
inboundlinks_anchortext_txt 
outboundlinks_anchortext_txt 


icons_urlstub_sxt 
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[ E] content of description-tag(s) 


the 64 bit hash of the 
org.apache.solr.update.processor.Lookup3Signature of 
description, used to compute description_unique_b 


flag shows if description is unique within all indexable 
C] documents of the same host with status code 200; if yes and 
another document appears with same description, the unique- 
flag is set to false 
COo E] content of keywords tag; words are separated by space 
[ e] number of words in visible area 
[ E] number of all outgoing links; including linksnofollowcount_i 
[ E] number of all outgoing inks with nofollow tag 


number of outgoing inbound (to same domain) links; including 
[ o åy O inboundlinksnofollowcount_i 

number of outgoing inbound (to same domain) links with 
LT nofollow tag 

number of outgoing outbound (to other domain) links, including 
[ outboundlinksnofollowcount_i 

number of outgoing outbound (to other domain) links with 
[ i åäë E nofollow tag 


[ E] number of images 

COo E] response time of target server in milliseconds 

[ f] all visible text 

[ E] additional synonyms to the words in the text 

[ e] the md5 of the raw source 

[ E] redirect url if the error code is 299 < httpstatus_i < 310 

[| tags that are attached to crawls/index generation to separate 
the search result into user-defined subsets 

fF number of entries in css_tag_txt and css_url_txt 

[ e] full css tag with normalized url 

[ E] normalized urls within a css tag 

[ E] normalized urls within a scripts tag 

[ f] number of entries in scripts_sxt 

C i] content of <meta name="robots" content=#content#> tag and 
the "X-Robots-Tag" HTTP property 

[ E] content of <meta name="generator" content=#content#> tag 

[ E] internal links, the visible anchor text 

[ E] external links, the visible anchor text 

[ e] all icon links without the protocol and '://' 
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icons_protocol_sxt 
icons_rel_sxt 


icons_sizes_sxt 
images_text_t 
images_alt_sxt 
images_height_val 
images_width_val 
images_pixel_val 
images_withalt_i 
htags_i 
canonical_s 
canonical_equal_sku_b 
refresh_s 

li_txt 

licount_i 

dt_txt 

dtcount_i 

dd_txt 

ddcount_i 
article_txt 


articlecount_i 
bold_txt 
boldcount_i 
italic_txt 
italiccount_i 
underline_txt 


underlinecount_i 
flash_b 
frames_sxt 
framesscount_i 
iframes_sxt 


iframesscount_i 


hreflang_url_sxt 


hreflang_cc_sxt 


navigation_url_sxt 
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[ E] all icon links protocols 

[| all icon links relationships space separated (e.g.. 'icon apple- 
touch-icon') 

[ E] all icon sizes space separated (e.g. '16x16 32x32’) 

[ e] all text/words appearing in image alt texts or the tokenized url 

ns all image link alt tag 

[ E] size of images:height 

[ E] size of images:width 

O] size of images as number of pixels (easier for a search 
restriction than width and height) 

[ E] number of image links with alt tag 

[ E] binary pattern for the existance of h1..h6 headlines 

[ E] url inside the canonical link element 

[ e] flag shows if the url in canonical_t is equal to sku 

[ E] link from the url property inside the refresh link element 

[ E] all texts in <li> tags 

[ f] number of <li> tags 

[ e] all texts in <dt> tags 

[ E] number of <dt> tags 

[ E] all texts in <dd> tags 

[ f] number of <dd> tags 

[ E] all texts in <article> tags 

[ E] number of <article> tags 

C] all texts inside of <b> or <strong> tags. no doubles. listed in 
the order of number of occurrences in decreasing order 

[ total number of occurrences of <b> or <strong> 

O] all texts inside of <i> tags. no doubles. listed in the order of 
number of occurrences in decreasing order 

[ E] total number of occurrences of <i> 

Co] all texts inside of <u> tags. no doubles. listed in the order of 
number of occurrences in decreasing order 

Ess total number of occurrences of <u> 

[ e] flag that shows if a swf file is linked 

[| istofal links to frames 

[ E] number of frames_txt 

[J list of alt links to iframes 

[ E] number of iframes_txt 


url of the hreflang link tag, see 
http://support.google.com/webmasters/bin/answer.py? 
hl=de&answer=189077 


country code of the hreflang link tag, see 


[ E] http://support.google.com/webmasters/bin/answer.py? 


hl=de&answer=189077 
page navigation url, see 


El http://googlewebmastercentral.blogspot.de/2011/09/pagination- 


with-relnext-and-relprev.html 
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() _ navigation_type_sxt 

(Q _ publisher_url_s 
url_protocol_s 
url_file_name_s 

(Q  url_file_name_tokens_t 
url_paths_count_i 
url_paths_sxt 

(Q  url_parameter_i 

(Q  url_parameter_key_sxt 
(Q  url_parameter_value_sxt 
url_chars_i 

(Q  host_dne_s 

(Q _ host_organizationdnc_s 
(Q  host_subdomain_s 
host_extent_i 


title_count_i 


o 


(J 


title_chars_val 


o 


title_words_val 


description_count_i 


description_chars_val 


description_words_val 


O 


hli 


O 


h2_i 


C) 


h3_i 


OD 


o 


h4_i 
h5_i 
Q hei 


{_}  schema_org_breadcrumb_i 


opengraph_title_t 
opengraph_type_s 
() opengraph_url_s 


{_} opengraph_image_s 








cr_host_count_i 
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CO] page navigation rel property value, can contain one of 
{top,up,next, prev, first,last} 

[id publisher url as defined in 
http://support.google.com/plus/answer/1713826?hl=de 

[ E] the protocol of the url 

.......... ] the file name (which is the string after the last '/’ and before the 
query part from '?' on) without the file extension 

C i] tokens generated from url_file_name_s which can be used for 
better matching and result boosting 

C i] number of all path elements in the url hpath (see: 
http://www. ietf.org/rfc/rfc1 738.txt) without the file name 

D i] all path elements in the url hpath (see: 
http://www. ietf.org/rfc/rfc1 738.txt) without the file name 

[ E] number of key-value pairs in search part of the url 

[ E] the keys from key-value pairs in the search part of the url 


[ f] the values from key-value pairs in the search part of the url 
[ e] number of all characters in the url == length of sku field 


CO] the Domain Class Name, either the TLD or a combination of 

ccSLD+TLD if a ccSLD is used. 

[ E] the organization and dnc concatenated with '.' 

[ f] the remaining part of the host without organizationdnc 

C] number of documents from the same host; can be used to 
measure references_internal_i for likelihood computation 

[ E] number of titles (counting the 'title' field) in the document 


[ E] number of characters for each title 
a number of words in each title 


number of descriptions in the document. Its not counting the 


[ E] ‘description’ field since there is only one. But it counts the 


number of descriptions that appear in the document (if any) 


[ E] number of characters for each description 

[ E] number of words in each description 

[ f] number of h1 header lines 

[ E] number of h2 header lines 

[ E] number of h3 header lines 

[ number of h4 header lines 

[ E] number of h5 header lines 

[ E] number of h6 header lines 

[ f] number of itemprop="breadcrumb" appearances in div tags 


Open Graph Metadata from og:title metadata field, see 
Co o y http://ogp.me/ns# 

Open Graph Metadata from og:type metadata field, see 
[ åy O http://ogp.me/ns# 

Open Graph Metadata from og:url metadata field, see 
a http://ogp.me/ns# 


[sid Open Graph Metadata from og:image metadata field, see 
http://ogp.me/ns# 
[ E] the number of documents within a single host 
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Set 


cr_host_chance_d 


cr_host_norm_i 


rating_i 
bold_val 
italic_val 


underline_val 


ext_cms_txt 


ext_cms_val 


ext_ads_txt 
ext_ads_val 
ext_community_txt 
ext_community_val 
ext_maps_txt 
ext_maps_val 
ext_tracker_txt 
ext_tracker_val 
ext_title_txt 


ext_title_val 


vocabularies_sxt 


[ 


[ 


[ e] names of ad-servers/ad-services 

[ E] number of attributes counts in ext_ads_txt 
dl names of recognized community functions 
[ f] number of attribute counts in attr_community 
[ E] names of map services 

[ E] number of attribute counts in ext_maps_txt 
[ names of tracker server 

[ e] number of attribute counts in ext_tracker_txt 
[ E] names matching title expressions 

[ E] number of matching title expressions 


collection of all vocabulary names that have a matcher in the 
document - use this to boost with vocabularies 


——— 
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[ 


the chance to click on this page when randomly clicking on 
links within on one host 


normalization of chance: 0 for lower halve of cr_host_count_i 


fF qj urls, 1 for 1/2 of the remaining and so on. the maximum 


number is 10 


[ custom rating; to be set with external rating information 
[ e] number of occurrences of texts in bold_txt 

[ E] number of occurrences of texts in italic_txt 

[ E] number of occurrences of texts in underline_txt 


names of cms attributes; if several are recognized then they 
are listen in decreasing order of number of matching criterias 
number of attributes that count for a specific cms in 
ext_cms_txt 


reset selection to default 


Reindex documents 


If you unselected some fields, old documents in the index still contain the unselected fields. To physically remove them from 
the index you need to reindex the documents. Here you can reindex all documents with inactive fields. 
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